Atomicity: a multi-pronged approach

ABSTRACT

In a multiprocessor system with speculative execution, atomicity can be approached in several fashions. One approach is to have atomic instructions that achieve multiple functions and are guaranteed to complete. Another approach is to have blocks of code that are grouped to succeed or fail together. A system can incorporate more than one such approach. In implementing more than one approach, the system may prioritize one over another. When conflict detection is done through a directory lookup in cache memory, atomic instructions and atomicity related operations may be implemented in a cache data array access pipeline in that cache memory. This implementation may include feedback to the pipeline for implementing multiple functions within an atomic instruction and also for cascading atomic instructions.

CROSS REFERENCE TO RELATED APPLICATIONS

Benefit is claimed of the following applications, in particular:

61/295,669, filed Jan. 15, 2010 and

61/299,911 filed Jan. 29, 2010,

both for “SPECULATION AND TRANSACTION IN A SYSTEM SPECULATION ANDTRANSACTION SUPPORT IN L2 L1 SUPPORT FOR SPECULATION/TRANSACTIONS IN A2PHYSICAL ALIASING FOR THREAD LEVEL SPECULATION MULTIFUNCTIONING L2 CACHECACHING MOST RECENT DIRECTORY LOOK UP AND PARTIAL CACHE LINE SPECULATIONSUPPORT”,

And of the following applications in general:

Benefit of the following applications is claimed and they are alsoincorporated by reference: U.S. patent application Ser. No. 12/796,411filed Jun. 8, 2010; U.S. patent application Ser. No. 12/696,780, filedJan. 29, 2010; U.S. provisional patent application Ser. No. 61/293,611,filed Jan. 8, 2010; U.S. patent application Ser. No. 12/684,367, filedJan. 8, 2010; U.S. patent application Ser. No. 12/684,172, filed Jan. 8,2010; U.S. patent application Ser. No. 12/684,190, filed Jan. 8, 2010;U.S. patent application Ser. No. 12/684,496, filed Jan. 8, 2010; U.S.patent application Ser. No. 12/684,429, filed Jan. 8, 2010; U.S. patentapplication Ser. No. 12/697,799 filed Feb. 1, 2010; U.S. patentapplication Ser. No. 12/684,738, filed Jan. 8, 2010; U.S. patentapplication Ser. No. 12/684,860, filed Jan. 8, 2010; U.S. patentapplication Ser. No. 12/684,174, filed Jan. 8, 2010; U.S. patentapplication Ser. No. 12/684,184, filed Jan. 8, 2010; U.S. patentapplication Ser. No. 12/684,852, filed Jan. 8, 2010; U.S. patentapplication Ser. No. 12/684,642, filed Jan. 8, 2010; U.S. patentapplication Ser. No. 12/684,804, filed Jan. 8, 2010; U.S. provisionalpatent application Ser. No. 61/293,237, filed Jan. 8, 2010; U.S. patentapplication Ser. No. 12/693,972, filed Jan. 26, 2010; U.S. patentapplication Ser. No. 12/688,747, filed Jan. 15, 2010; U.S. patentapplication Ser. No. 12/688,773, filed Jan. 15, 2010; U.S. patentapplication Ser. No. 12/684,776, filed Jan. 8, 2010; U.S. patentapplication Ser. No. 12/696,825, filed Jan. 29, 2010; U.S. patentapplication Ser. No. 12/684,693, filed Jan. 8, 2010; U.S. provisionalpatent application Ser. No. 61/293,494, filed Jan. 8, 2010; U.S. patentapplication Ser. No. 12/731,796, filed Mar. 25, 2010; U.S. patentapplication Ser. No. 12/696,746, filed Jan. 29, 2010; U.S. patentapplication Ser. No. 12/697,015 filed Jan. 29, 2010; U.S. patentapplication Ser. No. 12/727,967, filed Mar. 19, 2010; U.S. patentapplication Ser. No. 12/727,984, filed Mar. 19, 2010; U.S. patentapplication Ser. No. 12/697,043 filed Jan. 29, 2010; U.S. patentapplication Ser. No. 12/697,175, Jan. 29, 2010; U.S. patent applicationSer. No. 12/684,287, filed Jan. 8, 2010; U.S. patent application Ser.No. 12/684,630, filed Jan. 8, 2010; U.S. patent application Ser. No.12/723,277 filed Mar. 12, 2010; U.S. patent application Ser. No.12/696,764, filed Jan. 29, 2010; U.S. patent application Ser. No.12/696,817 filed Jan. 29, 2010; U.S. patent application Ser. No.12/697,164, filed Jan. 29, 2010; U.S. patent application Ser. No.12/796,411, filed Jun. 8, 2010; and, U.S. patent application Ser. No.12/796,389, filed Jun. 8, 2010.

All of the above-listed applications are incorporated by referenceherein.

GOVERNMENT CONTRACT

This invention was made with Government support under Contract No.:B554331 awarded by Department of Energy. The Government has certainrights in this invention.

BACKGROUND

The invention relates to guaranteeing atomicity in a multiprocessorenvironment.

In a multiprocessor environment, it is desirable to provide some memoryoperations that provide a atomicity for a sequence of accesses.

In some multiprocessor environments there may be more than one type ofmemory access construct that provides atomicity. For instance, in thePower PC Architecture, there is a pair of instructions called larx/stxthat requests atomicity for a load/store access pair to the same memorylocation. Hardware Transactional Memory is another method for providingatomicity for a larger sequence of memory accesses.

The field of the invention is speculative execution in a multiprocessorsystem, and in particular guaranteeing atomicity in such a system.

The following definition of atomicity appears in IBM®, Power ISA™Version 2.06, Jan. 30, 2009, “1.4 Single-copy Atomicity,” p. 655

-   -   An access is single-copy atomic, or simply atomic, if it is        always performed in its entirety with no visible fragmentation.        Atomic accesses are thus serialized: each happens in its        entirety in some order, even when that order is not specified in        the program or enforced between processors . . . . An access        that is not atomic is performed as a set of smaller disjoint        atomic accesses.

This book will be referred to herein as “PowerPC Architecture” isincorporated by reference in its entirety.

In the Power PC architecture, atomicity is often thought of in thecontext of the larx/stcx type instructions. This type of instruction hasseveral forms:

lwarx/stwcx, for word accesses;

lbarx/stbcx for byte accesses,

lharx/sthcx for halfword accesses, and

ldarx/stdcx for double word accesses.

These instructions come in pairs that delimit a block of instructionsthat the programmer desires to have complete atomically. If the stcxinstruction indicates a failure of atomicity, then the whole blockfails. More about an implementation of larx/stcx appears in co-pendingapplication Ser. No. 12/697,799 filed Jan. 29, 2010, which isincorporated herein. This co-pending application is not conceded to beprior art.

With new multiprocessing architectures, new mechanisms for guaranteeingatomicity are desirable.

SUMMARY

In the latest IBM® Blue Gene® architecture, the point of coherence is adirectory lookup mechanism in a cache memory. It would be desirable toguarantee a hierarchy of atomicity options within that architecture.

In one embodiment, a multiprocessor system includes a plurality ofprocessors, a conflict checking mechanism, and an instructionimplementation mechanism. The processors are adapted to carry outspeculative execution in parallel. The conflict checking mechanism isadapted to detect and protect results of speculative executionresponsive to memory access requests from the processors. Theinstruction implementation mechanism cooperates with the processors andconflict checking mechanism adapted to implement an atomic instructionthat includes load, modify, and store with respect to a single memorylocation in an uninterruptible fashion.

In another embodiment, a system includes a plurality of processors andat least one cache memory. The processors are adapted to issue atomicityrelated operations. The operations include at least one atomicinstruction and at least one other type of operation. The atomicinstruction includes sub-operations including a read, a modify, and awrite. The other type of operation includes at least one atomicityrelated operation. The cache memory includes an cache data array accesspipeline and a controller. The controller is adapted to prevent theother types operations from entering the cache data array accesspipeline, responsive to an atomic instruction in the pipeline, whenthose other types of operation compete with the atomic instruction inthe pipeline for a memory resource.

In yet another embodiment, a multiprocessor system includes a pluralityof processors, a central conflict checking mechanism, and a prioritizer.The processors are adapted to implement parallel speculative executionof program threads and to implement a plurality of atomicity relatedtechniques. The central conflict checking mechanism resolves conflictsbetween the threads. The prioritizer prioritizes at least one atomicityrelated technique over at least one other atomicity related technique.

In a further embodiment, a computer method includes issuing an atomicinstruction, recognizing the atomic instruction, and blocking otheroperations. The atomic instruction is issued from one of the processorsin a multi-processor system and defines sub-operations that includereading, modifying, and storing with respect to a memory resource. Adirectory based conflict checking mechanism recognizes the atomicinstruction. Other operations seeking to access the memory resource areblocked until the atomic instruction has completed.

Objects and advantages will be apparent in the following.

BRIEF DESCRIPTION OF DRAWING

Embodiments will now be described by way of non-limiting example withreference to the following figures:

FIG. 1 shows an overview of a multiprocessor system within which cachingimprovements may be implemented.

FIG. 1A shows some software running in a distributed fashion on themultiprocessor system.

FIG. 1B shows a timing diagram with respect to TM type speculativeexecution.

FIG. 2 shows a map of a cache slice.

FIG. 3 is a schematic of the control unit of an L2 slice.

FIG. 3A shows a request queue and retaining data associated with aprevious memory access request.

FIG. 3B shows interaction between the directory pipe and directory SRAM.

FIG. 4 shows structure of the directory SRAM 309.

FIG. 5 shows more detail of operation of the L2 central unit.

FIG. 6 shows operation of a cache data array access pipe with respect toatomicity related functions.

FIG. 7 shows interaction between code sections embodying some differentapproaches to atomicity.

FIG. 8 is a flowchart relating to queuing atomic instructions.

DETAILED DESCRIPTION

The term “thread” is used herein. A thread can be either a hardwarethread or a software thread. A hardware thread within a core processorincludes a set of registers and logic for executing a software thread.The software thread is a segment of computer program code.

These threads can be the subject of “speculative execution,” meaningthat a thread or threads can be started as a sort of wager or gamble,without knowledge of whether the thread can complete successfully. Agiven thread cannot complete successfully if some other thread modifiesthe data that the given thread is using in such a way as to invalidatethe given thread's results. The terms “speculative,” “speculatively,”“execute,” and “execution” are terms of art in this context. These termsdo not imply that any mental step or manual operation is occurring. Alloperations or steps described herein are to be understood as occurringin an automated fashion under control of computer hardware or software.

If speculation fails, the results must be invalidated and the threadmust be re-run or some other workaround found. Generally, recovery fromfailure of any kind of speculative execution in the current embodimentrelates to undoing changes made by a thread. Once a software thread iscommitted, the actions taken by the thread become irreversible.

Three modes of speculative execution are supported in the currentembodiment: Thread Level Speculation (“TLS”), Transactional Memory(“TM”), and Rollback.

TM occurs in response to a specific programmer request. Generally theprogrammer will put instructions in a program delimiting sections inwhich TM is desired. This may be done by marking the sections asrequiring atomic execution. “An access is single-copy atomic, or simply“atomic”, if it is always performed in its entirety with no visiblefragmentation.” IBM® Power ISA™ Version 2.06, Jan. 30, 2009. In atransactional model, the programmer replaces critical sections withtransactional sections at 1601 (FIG. 7), which can manipulate shareddata without locking. When the section ends, the program will makeanother call that ultimately signals the hardware to do conflictchecking and reporting.

Normally TLS occurs when a programmer has not specifically requestedparallel operation. Sometimes a compiler will ask for TLS execution inresponse to a sequential program. When the programmer writes thissequential program, she may insert commands delimiting sections. Thecompiler can recognize these sections and attempt to run them inparallel.

Rollback occurs in response to “soft errors,” normally these errorsoccur in response to cosmic rays or alpha particles from solder balls.Rollback is discussed in more detail in co-pending application Ser. No.12/696,780, which is incorporated herein by reference.

The present invention arose in the context of the IBM® Blue Gene®project, which is further described in the applications incorporated byreference above. FIG. 1 is a schematic diagram of an overallarchitecture of a multiprocessor system in accordance with this project,and in which the invention may be implemented. At 101, there are aplurality of processors operating in parallel along with associatedprefetch units and L1 caches. At 102, there is a switch. At 103, thereare a plurality of L2 slices. At 104, there is a main memory unit. It isenvisioned, for the present embodiment, that the L2 cache should be thepoint of coherence.

FIG. 1A shows some software running in a distributed fashion,distributed over the cores of node 50. An application program is shownat 131. If the application program requests TLS or TM, a runtime system132 will be invoked. This runtime system is particularly to manage TMand TLS execution and can request domains of IDs from the operatingsystem 133. The operating system configures the hardware to definedomains and modes of execution. “Domains” in this context are numericalgroups of IDs that can be assigned to a mode of speculation. More aboutthis use of domains can be found in the provisional applications61/295,669, filed Jan. 15, 2010 and 61/299,911 filed Jan. 29, 2010,incorporated by reference above. The runtime system can also be calledto request allocation of IDs and to start a speculative section, as wellas to end a section and determine the outcome of the speculation. Moreabout a runtime system and about allocation and commitment of ID's canbe found in the provisional applications 61/295,669, filed Jan. 15, 2010and 61/299,911 filed Jan. 29, 2010, incorporated by reference above.

The application program can also request various operation types, forinstance as specified in a standard such as the PowerPC architecture.These operation types might include larx/stcx pairs or atomicinstructions, to be discussed further below.

FIG. 1B shows a timing diagram explaining how TM execution might work onthis system. At 141 the program starts executing. At the end of block141, a call for TM is made. In 142 the run time system receives thisrequest and conveys it to the operating system. At 143, the operatingsystem confirms the availability of the mode. The operating system canaccept, reject, or put on hold any requests for a mode. The confirmationis made to the runtime system at 144. The confirmation is received atthe application program at 145. If there had been a refusal, the programwould have had to adopt a different strategy, such as serialization orwaiting for modes or domains to become available. Because the requestwas accepted, parallel sections can start running at the end of 145. Theruntime system gets speculative IDs from the hardware at 146 andtransmits them to the application program at 147, which then uses them.The program knows when to finish speculation at the end of 147. Then therun time system asks for the ID to commit at 148. Any conflictinformation can be transmitted back to the application program at 149,which then may try again or adopt other strategies. If there is aconflict, an interrupt is raised by the L2. The L2 will send theinterrupt to the hardware thread that was using the ID. This hardwarethread then has to figure out, based on the state the runtime system isin and the state the L2 central provides indicating a conflict, what todo in order to resolve the conflict. For example, it might execute thetransactional memory section again which causes the software to jumpback to the start of the transaction.

If the hardware determines that no conflict has occurred, thespeculative results of the associated thread can be made persistent.

In response to a conflict, trying again may make sense where anotherthread completed successfully, which may allow the current thread tosucceed. If both threads restart, there can be a “lifelock,” where bothkeep failing over and over. In this case, the runtime system may have toadopt other strategies like getting one thread to wait, choosing onetransaction to survive and killing others, or other strategies, all ofwhich are known in the art.

FIG. 2 shows a cache slice. It includes arrays of data storage 201, anda central control portion 202.

FIG. 3 shows features of an embodiment of the control section 102 of acache slice 103.

Coherence tracking unit 301 issues invalidations, when necessary. Theseinvalidations are issued centrally, while in the prior generation of theBlue Gene® project, invalidations were achieved by snooping.

The request queue 302 buffers incoming read and write requests. In thisembodiment, it is 16 entries deep, though other request buffers mighthave more or less entries. The addresses of incoming requests arematched against all pending requests to determine ordering restrictions.The queue presents the requests to the directory pipeline 308 based onordering requirements.

The write data buffer 303 stores data associated with write requests.This buffer passes the data to the cache data array access pipeline,which is here implemented as eDRAM pipeline 305, in case of a write hitor after a write miss resolution.

The directory pipeline 308 accepts requests from the request queue 302,retrieves the corresponding directory set from the directory SRAM 309,matches and updates the tag information, writes the data back to theSRAM and signals the outcome of the request (hit, miss, conflictdetected, etc.).

The L2 implements four parallel eDRAM pipelines 305 that operateindependently. They may be referred to as eDRAM bank 0 through eDRAMbank 3. The eDRAM pipeline controls the eDRAM access and the dataflowfrom and to this macro. If writing only subcomponents of a doubleword orfor load-and-increment or store-add operations, it is responsible toschedule the necessary RMW cycles and provide the dataflow for insertionand increment.

The read return buffer 304 buffers read data from eDRAM or the memorycontroller and is responsible for scheduling the data return using theswitch 102. In this embodiment it has a 32B wide data interface to theswitch. It is used only as a staging buffer to compensate forbackpressure from the switch. It is not serving as a cache.

The miss handler 307 takes over processing of misses determined by thedirectory. It provides the interface to the DRAM controller andimplements a data buffer for write and read return data from the memorycontroller,

The reservation table 306 registers and invalidates reservationrequests.

Per FIG. 3A, the L2 slice 103 includes a request queue 302. At 311, acascade of modules tests whether pending memory access requests willrequire data associated with the address of a previous request, theaddress being stored at 313. These tests might look for memory mappedflags from the L1 or for some other identification. A result of thecascade 311 is used to create a control input at 314 for selection ofthe next queue entry for lookup at 315, which becomes an input for thedirectory look up module 312.

FIG. 3B shows more about the interaction between the directory pipe 308and the directory SRAM 309. The vertical lines in the pipe representtime intervals during which data passes through a cascade of registersin the directory pipe. In a first time interval T1, a read is signaledto the directory SRAM. In a second time interval T2, data is read fromthe directory SRAM. In a third time interval, T3, a table lookup informswrites WR and WR DATA to the directory SRAM. In general, table lookupwill govern the behavior of the directory SRAM to control cache accessesresponsive to speculative execution. Only one table lookup is shown atT3, but more might be implemented.

FIG. 4 shows the formats of 4 directory SRAMs included at 309, to with:

-   -   a base directory 321;    -   a least recently used directory 322;    -   a COH/dirty directory 323 and 323′; and    -   a speculative reader directory 324.

In the base directory, 321, there are 15 bits that locate the line at271. Then there is a seven bit speculative writer ID field 272 and aflag 273 that indicates whether the write is speculative. Then there isa two bit speculative read flag field 274 indicating whether to invokethe speculative reader directory 324, and a one bit “current” flag 275.The current flag 275 indicates whether the current line is assembledfrom more than one way or not. The processor, A2, does not know aboutthe fields 272-275. These fields are set by the L2 directory pipeline.

If the speculative writer flag is checked, then the way has been writtenspeculatively, not taken from main memory and the writer ID field willsay what the writer ID was. If the flag clears, the writer ID field isirrelevant.

The LRU directory indicates “age”, in other words a period of time sincea way was used. This directory is for allocating ways in accordance withthe Least Recently Used algorithm.

The COH/dirty directory has two uses, and accordingly two possibleformats. In the first format, 323, known as “COH,” there are 17 bits,one for each core of the system. This format indicates, when the writerflag is not set, whether the corresponding core has a copy of this lineof the cache. In the second format, 323′, there are 16 bits. These bitsindicate, if the writer flag is set in the base directory, which part ofthe line has been modified speculatively. The line has 128 bytes, butthey are recorded at 323′ in groups of 8 bytes, so only 16 bits areused, one for each group of eight bytes.

The operation of the pipe control unit 310 and the EDRAM queuedecoupling buffer 300 will be described more below with reference toFIG. 6.

The L2 implements a multitude of decoupling buffers for differentpurposes, e.g.

-   -   The Request queue is an intelligent decoupling buffer (with        reordering logic), allowing to receive requests from the        switches even if the directory pipe is blocked    -   The write data buffer accepts write data from the switch even if        the eDRAM pipe is blocked or the target location in the eDRAM is        not yet known    -   The Coherence tracking implements two buffers: One decoupling        the directory lookup sending to it requests from the internal        coherence SRAM lookup pipe. And one decoupling the SRAM lookup        results from the interface to the switch.    -   The miss handler implements one from the DRAM controller to the        eDRAM and one from the eDRAM to the DRAM controller    -   There are more, almost every little subcomponent that can block        for any reason is connected via a decoupling buffer to the unit        feeding requests to it

The L2 caches may operate as set-associative caches while alsosupporting additional functions, such as memory speculation forSpeculative Execution (SE—also referred to as TLS), Transactional Memoryand local memory rollback, as well as atomic memory transactions.Support for such functionalities includes additional bookkeeping andstorage functionality for multiple versions of the same physical memoryline.

To reduce main memory accesses, the L2 cache may serve as the point ofcoherence for all processors. In performing this function, an L2 centralunit will have responsibilities such as defining domains of speculationIDs, assigning modes of speculation execution to domains, allocatingspeculative IDS to threads, trying to commit the IDs, sending interruptsto the cores in case of conflicts, and retrieving conflict information.This function includes generating L1 invalidations when necessary.Because the L2 caches are inclusive of the L1s, they can remember whichprocessors could possibly have a valid copy of every line, and they canmulticast selective invalidations to such processors. The L2 caches areadvantageously a synchronization point, so they coordinatesynchronization instructions from the PowerPC architecture, such aslarx/stcx.

Larx/stcx

The larx and stcx instructions are used to perform a read-modify-writeoperation to storage. If the store is performed, the use of the larx andstcx instruction pair ensures that no other processor or mechanism hasmodified the target memory location between the time the larxinstruction is executed and the time the stcx. instruction completes.The larx instruction loads the word from the location in storagespecified by the effective address into a target register. In addition,a reservation on the memory location is created for use by a subsequentstcx instruction. The stcx instruction is used in conjunction with apreceding larx instruction to emulate a read-modify-write operation on aspecified memory location.

The L2 caches will handle larx/stcx reservations and ensure theirconsistency. They are a natural location for this responsibility becausesoftware locking is dependent on consistency, which is managed by the L2caches.

The core basically hands responsibility for larx/stcx consistency andcompletion off to the external memory system. Unlike the core, it doesnot maintain an internal reservation and it avoids complex cachemanagement through simple invalidation. Larx is treated like acache-inhibited load, but invalidates the target line if it hits in theL1 cache. Similarly, stcx is treated as a cache-inhibited store and alsoinvalidates the target line in L1 if it exists.

The L2 cache is expected to maintain reservations for each thread, andno special internal consistency action is taken by the core whenmultiple threads attempt to use the same lock. To support this, a threadis blocked from issuing any L2 accesses while a larx from that thread isoutstanding, and it is blocked completely while a stcx is outstanding.The L2 cache will support larx/stcx as described in the next severalparagraphs.

Each L2 slice has 17 reservation registers. Each reservation registerconsists of a 25-bit address register and an 9-bit thread ID registerthat identifies which thread has reserved the stored address andindicates whether the register is valid (i.e. in use).

When a larx occurs, the valid reservation thread ID registers aresearched to determine if the thread has already made a reservation. Ifso, the existing reservation is cleared. In parallel, the registers aresearched for matching addresses. If found, the thread ID is tried to beadded to the thread identifier. If either no address is found or thethread ID could not be added to reservation registers with matchingaddresses, a new reservation is established. If a register is available,it is used, otherwise a random existing reservation is evicted and a newreservation is established in its place. The larx continues as anordinary load and returns data.

Every store searches the valid reservation address registers. Allmatching registers are simply invalidated. The necessaryback-invalidations to cores will be generated by the normal coherencemechanism.

When a stcx occurs, the valid reservation registers 306 are searched forentries with both a matching address and a matching thread ID. If bothof these conditions are met, then the stcx is considered a success. Stcxsuccess is returned to the requesting core and the stcx is converted toan ordinary store (causing the necessary invalidations to other cores bythe normal coherence mechanism). If either condition is not met, thenthe stcx is considered a failure. Stcx fail is returned to therequesting core and the stcx is dropped. In addition, for every stcx anypending reservation for the requesting thread is invalidated.

To allow more than 17 reservations per slice, the actual thread ID fieldis encoded by the core ID and a vector of 4 bits, each representing athread of the indicated core. If a reservation is established, first acheck for matching address and core number n any register is made. If aregister has both matching address and matching core, the correspondingthread bit is activated. Only if all bits are clear, the entire registeris assumed invalidated and available for reallocation without eviction.

Atomic Operations

The L2 supports multiple atomic instructions or operations on 8Bentities. These operations are sometimes of the type that perform read,modify, and write back atomically—in other words that combine severalfrequently used instructions and guarantee that they can performsuccessfully. The operation is selected based on address bits as definedin the memory map and the type of access. These operations willtypically require RAW, WAW, and WAR checking. The directory lookup phasewill be somewhat different from other instructions, because both readand write are contemplated.

FIG. 6 shows aspects of the L2 cache data array access pipeline,implemented as EDRAM pipeline 305 in the preferred embodiment, pertinentto atomic operations. In this pipeline, data is typically ready afterfive cycles. At 461, some read data is ready. Error correcting codes(ECC) are used to make sure that the read data is error free. Then readdata can be sent to the core at 463. If it is one of theseread/modify/write atomic operations or instructions, the datamodification is performed at 462, followed by a write back to eDRAM at465, which feeds back to the beginning of the pipeline per 464, whileother matching requests are blocked from the pipeline, guaranteeingatomicity. Sometimes, two such compound instructions will be carried outsequentially, in other words cascaded. In such a case, any number ofthem can be linked using a feedback at 466. To assemble a line, severaliterations of this pipeline structure may be undertaken. More aboutassembling lines can be found in the provisional applicationsincorporated by reference above. Thus atomic instructions, which reservethe EDRAM pipeline, can achieve performance results that a sequence ofoperations cannot while guaranteeing atomicity.

It is possible to feed two atomic operations or instructions to twodifferent addresses together through the EDRAM pipe: read a, read b,then write a and b.

FIG. 7 shows a comparison between approaches to atomicity. At 1601 athread executing pursuant to a TM model is shown. At 1602 a block ofcode protected by a larx/stcx pair is shown. At 1603 an atomic operationis shown.

Thread 1601 includes three parts,

a first part 1604 that involves at least one load instruction;

a second part 1605 that involves at least one store instruction; and

a third part 1606 where the system tries to commit the thread.

Arrow 1607 indicates that the reader set directory is active for thatpart. Arrow 1608 indicates that the writer set directory is active forthat part.

Code block 1602 is delimited by a larx instruction 1609 and a stcxinstruction 1610. Arrow 1611 indicates that the reservation table 306 isactive. When the stcx instruction executes, if there has been any reador write conflict, the whole block 1602 fails.

Atomic operation 1603 is one of the types indicated in table below, forinstance “load increment.” The arrows at 1612 show the arrival of theatomic operation during the periods of time delimited by double arrowsat 1607 and 1611. The atomic operation is guaranteed to complete due tothe block on the EDRAM pipe for the relevant memory accesses.Accordingly, if there is a concurrent use by a TM thread 1601 and/or bya block of code protected by LARX/STCX 1602, and if those uses accessthe same memory location as the atomic operation 1603, a conflict willbe signaled and results of the code blocks 1601 and 1602 will beinvalidated. An uninterruptible, persistent atomic instruction will begiven priority over a reversible operation, e.g. TM transaction, or aninterruptible operation, e.g., a LARX/STCX pair.

As between blocks 1601 and 1602, which is successful and whichinvalidates will depend on the order of operations, if they compete forthe same memory resource. For instance, in the absence of 1603, if thestcx instruction 1610 completes before the commit attempt 1606, thelarx/stcx box will succeed while the TM thread will fail. Alternatively,also in the absence of 1603, if the commit attempt 1606 completes beforethe stcx instruction 1610, then the larx/stcx block will fail. The TMthread can actually function a bit like multiple larx/stcx pairstogether.

FIG. 8 shows some issues relating to queuing operations. At 1701, anatomic instruction issues from a processor. It takes the form of amemory access with the lower bits indicating an address of a memorylocation and the upper bits indicating which operation is desired. At1702, the L1D and L1P treat this operation as an ordinary memory accessto an address that is not cached. At 1703, in the pipe control unit ofthe L2 cache slice, the operation is recognized as an atomic instructionresponsive to a directory lookup. The directory lookup also determineswhether there are multiple versions of the data accessed by the atomicinstruction. At 1704, if there are multiple versions, control istransferred to the miss handler.

At 1705, the miss handler treats the existence of multiple versions as acache miss. It blocks further accesses to that set and prevents themfrom entering the queue, by directing them to the EDRAM decouplingbuffer. With respect to the set, the EDRAM pipe is then made to carryout copy/insert operations at 1707 until the aggregation is complete at1708. This version aggregation loop is used for ordinary memory accessesto cache lines that have multiple versions.

Once the aggregation is complete, or if there are not multiple versions,control passes to 1710 where the current access is inserted into theEDRAM queue. If there is already an atomic instruction relating to thisline of the cache at 1711, then, at 1711, the current operation mustwait in the EDRAM decoupling buffer. Non atomic operations orinstructions will similarly have to be decoupled if they seek to accessa cache line that is currently being accessed by an atomic instructionin the EDRAM queue. If there are no atomic instructions relating to thisline in the queue, then control passes to 1713 where the currentoperation is transferred to the EDRAM queue. Then, at 1714, the atomicinstruction traverses the EDRAM queue twice, once for the read andmodify and once for the write. During this traversal, other operationsseeking to access the same line may not enter the EDRAM pipe, and willbe decoupled into the decoupling buffer.

The following atomic instructions are examples that are supported in thepresent embodiment, though others might be implemented. These operationsare implemented in addition to the memory mapped i/o operations in thePowerPC architecture.

Load/ Opcode Store Operation Function Comment 000 Load Load Load thecurrent value 001 Load Load Clear Fetch current value and store zero 010Load Load Fetch current value and increment 0xFFFF FFFF FFFF Incrementstorage FFFF rolls over to 0. So when sw uses the counter as unsigned,+2{circumflex over ( )}64 − 1 rolls over to 0. Thanks to two'scomplement, sw can use the counter as signed or unsigned. When using assigned, +2{circumflex over ( )}63 − 1 rolls over to −2{circumflex over( )}63. 011 Load Load Fetch current value and 0 rolls over to toDecrement decrement storage 0xFFFF FFFF FFFF FFFF. So when sw uses thecounter as unsigned, 0 rolls over to +2{circumflex over ( )}64 − 1.Thanks to two's complement, sw can use the counter as signed orunsigned. When using as signed, −2{circumflex over ( )}63 rolls over to2{circumflex over ( )}63 − 1. 100 Load Load The counter is the addressgiven The 8 B counter and its Increment and the boundary is the 8 Bboundary efficiently Bounded SUBSEQUENT address. support If counter andboundary values producer/consumer differ, increment counter andqueue/stack/deque with return old value, else return multiple producersand 0x8000 0000 0000 0000. multiple consumers. if The counter and(*ptrCounter==*(ptrCounter+1)){ boundary pair must be return 0x8000 00000000 0000; within a 32Byte line. // +2{circumflex over ( )}63 unsignedRollover and // −2{circumflex over ( )}63 signed signed/nusigned } else{ software use are as for oldValue = *ptrCounter; ‘load increment’++*ptrCounter; instruction. return oldValue; On boundary, 0x8000 } 00000000 0000 is returned. So unsigned use is also restricted to the uppervalue 2{circumflex over ( )}63 − 1, instead of the optimal 2{circumflexover ( )}64 − 1. This factor 2 loss is not expected to be a problem inpractice. 101 Load Load The counter is the address given Comments as for‘Load Decrement and the boundary is the Increment Bounded’ BoundedPREVIOUS address. If counter and boundary values differ, decrementcounter and return old value, else return 0x8000 0000 0000 0000. if(*ptrCounter==*(ptrCounter− 1)){ return 0x8000 0000 0000 0000; //+2{circumflex over ( )}63 unsigned // −2{circumflex over ( )}63 signed }else { oldValue = *ptrCounter; --*ptrCounter; return oldValue; } 110Load Load The counter is the address given The 8 B counter and itsIncrement if and the compare value is the 8 B compare value equalSUBSEQUENT address. efficiently support If counter and boundary valuestrylock operations for are equal, increment counter and mutex locks.return old value, else return The counter and 0x8000 0000 0000 0000.boundary pair must be if within a 32Byte line.(*ptrCounter!=*(ptrCounter+l)){ Rollover and return 0x8000 0000 00000000; signed/nusigned // +2{circumflex over ( )}63 unsigned software useare as for // −2{circumflex over ( )}63 signed load increment' } else {instruction. oldValue = *ptrCounter; On mismatch, 0x8000 ++*ptrCounter;0000 0000 0000 is return oldValue; returned. } So unsigned use is alsorestricted to the upper value 2{circumflex over ( )}63 − 1, instead ofthe optimal 2{circumflex over ( )}64 − 1. This factor 2 loss is notexpected to be a problem in practice. 000 Store Store Store the givenvalue 001 Store StoreTwin Store 8 B value to 8 B address Used for fastdeque given and to the SUBSEQUENT implementations 8 B address, if thesetwo locations The address pair must previously had the equal values. bewithin a 32Byte line. 010 Store Store Add Add store value to storage0xFFFF FFFF FFFF FFFF and earlier rolls over to 0 and beyond. Vice versain the other direction. So when sw uses the counter as unsigned,+2{circumflex over ( )}64 − 1 and earlier rolls over to 0 and beyond.Thanks to two's complement, sw can use the counter and ‘store value’ assigned or unsigned. When using as signed, and adding a positive storevalue, then ′+2{circumflex over ( )}63 − 1 and earlier rolls over to−2{circumflex over ( )}63 and beyond. Vice versa, when adding a negativestore value. 011 Store Store As Store Add, but do not keep Add/CoherenceL1-caches coherent unless on Zero storage value reaches zero 100 StoreStore Or Logical OR value to storage 101 Store Store Xor Logical XORvalue to storage 110 Store Store Max Store Max of value and storage,Unsigned values are interpreted as unsigned binary 111 Store Store MaxStore Max of value and storage, Allows Max of floating Sign/Value valuesare interpreted as 1 b sign point numbers and 63 b absolute value If theencoding of either operand represents a NaN, the operand is assumed tobe positive for comparison purposes.

A load increment acts similarly to a load. This instruction provides adestination address to be loaded and incremented. In other words, theload gets a special modification that tells the memory subsystem not tosimply load the value, but also increment it and write the incrementeddata back to the same location. This instruction is useful in variouscontexts. For instance if there is a workload to be distributed tomultiple threads, and it is not known how many threads will share theworkload or which one is ready, then the workload can be divided intochunks. A function can associate a respective integer value to each ofthese chunks. Threads can use load-increment to get a workload by numberand process it.

Each of these instructions acts like a modification of main memory. Ifany of the core/L1 units has a copy of the modified value, it will get anotification that the memory value has changed—and it evicts andinvalidates its local copy. The next time the core/L1 unit needs thevalue, it has to fetch it from the l2. This process happens each timethe location is modified in the l2.

A common pattern is that some of the core/L1 units will be programmed toact when a memory location modified by atomic instructions reaches aspecific value. When polling for the value, repeated L1 misses, fetchesfrom L2 followed by L1 invalidations due to atomic instructions occur.

Store_add_coherence_on_zero reduces the events of the local cache beinginvalidated and a new copy gotten from the l2 cache. With this atomicinstruction, L1 cache lines will be left incoherent and not invalidatedunless the modified value reaches zero. The threads waiting for zero canthen keep checking whatever their local value its L1 cache is even ifthat local value is inaccurate, until the value is actually zero. Thismeans that one thread might modify the value as far as the L2 isconcerned, without generating a miss for other threads.

In general, the instructions in the above table, called “atomic” have aneffect that the regular load and store does not have. They load, read,modify and write back in one atomic instruction, even within the contextof speculation. This type of operation works in the context ofspeculation, because of the loop back in the EDRAM pipeline. It executesconflict checking equivalent to a sequence of a load and a store. Beforethe atomic instruction is loading, it does the version aggregationdiscussed further in the provisional applications incorporated byreference above.

Although the embodiments of the present invention have been described indetail, it should be understood that various changes and substitutionscan be made therein without departing from spirit and scope of theinventions as defined by the appended claims. Variations described forthe present invention can be realized in any combination desirable foreach particular application. Thus particular limitations, and/orembodiment enhancements described herein, which may have particularadvantages to a particular application need not be used for allapplications. Also, not all limitations need be implemented in methods,systems and/or apparatus including one or more concepts of the presentinvention.

The present invention can be realized in hardware, software, or acombination of hardware and software. A typical combination of hardwareand software could be a general purpose computer system with a computerprogram that, when being loaded and run, controls the computer systemsuch that it carries out the methods described herein. The presentinvention can also be embedded in a computer program product, whichcomprises all the features enabling the implementation of the methodsdescribed herein, and which—when loaded in a computer system—is able tocarry out these methods.

Computer program means or computer program in the present contextinclude any expression, in any language, code or notation, of a set ofinstructions intended to cause a system having an information processingcapability to perform a particular function either directly or afterconversion to another language, code or notation, and/or reproduction ina different material form.

It is noted that the foregoing has outlined some of the more pertinentobjects and embodiments of the present invention. This invention may beused for many applications. Thus, although the description is made forparticular arrangements and methods, the intent and concept of theinvention is suitable and applicable to other arrangements andapplications. It will be clear to those skilled in the art thatmodifications to the disclosed embodiments can be effected withoutdeparting from the spirit and scope of the invention. The describedembodiments ought to be construed to be merely illustrative of some ofthe more prominent features and applications of the invention. Otherbeneficial results can be realized by applying the disclosed inventionin a different manner or modifying the invention in ways known to thosefamiliar with the art.

The word “comprising”, “comprise”, or “comprises” as used herein shouldnot be viewed as excluding additional elements. The singular article “a”or “an” as used herein should not be viewed as excluding a plurality ofelements. Unless the word “or” is expressly limited to mean only asingle item exclusive from other items in reference to a list of atleast two items, then the use of “or” in such a list is to beinterpreted as including (a) any single item in the list, (b) all of theitems in the list, or (c) any combination of the items in the list.Ordinal terms in the claims, such as “first” and “second” are used fordistinguishing elements and do not necessarily imply order of operation.

Items illustrated as boxes in flowcharts herein might be implemented assoftware or hardware as a matter of design choice by the skilledartisan. Software might include sequential or parallel code, includingobjects and/or modules. Modules might be organized so that functionsfrom more than one conceptual box are spread across more than one moduleor so that more than one conceptual box is incorporated in a singlemodule. Data and computer program code illustrated as residing on amedium might in fact be distributed over several media, or vice versa,as a matter of design choice. Such media might be of any suitable type,such as magnetic, electronic, solid state, or optical.

Any algorithms given herein can be implemented as computer program codeand stored on a machine readable medium, to be performed on at least oneprocessor. Alternatively, they may be implemented as hardware. They arenot intended to be executed manually or mentally.

The use of variable names in describing operations or instructions in acomputer does not preclude the use of other variable names for achievingthe same function.

1. A multiprocessor system comprising: a plurality of processors adaptedto carry out speculative execution in parallel; a conflict checkingmechanism adapted to detect and protect results of speculative executionresponsive to memory access requests from the processors; and aninstruction implementation mechanism cooperating with the processors andconflict checking mechanism adapted to implement an atomic instructionthat includes load, modify, and store with respect to a single memorylocation in an uninterruptible fashion.
 2. The system of claim 1,comprising a cache memory, wherein the cache memory includes: adirectory; storage locations; a directory lookup mechanism adapted toimplement the conflict checking.
 3. The system of claim 2, wherein thedirectory lookup mechanism is further adapted to detect the atomicinstruction responsive to a memory access to a distinct address.
 4. Thesystem of claim 1, wherein the atomic instruction specifies a memoryaccess relating to a cache line, the conflict checking mechanism isdisposed within a cache memory unit containing the line, and the systemcomprises a cache memory including at least one queue along with ablocking mechanism for preventing accesses to the queue corresponding tothe cache line.
 5. The system of claim 1, wherein the instructionimplementation mechanism comprises: a queue in a cache memory forqueuing memory access requests; and a feedback loop for feeding back atleast one later part of the atomic instruction after a first part hascompleted.
 6. The system of claim 5, wherein the processors are furtheradapted to carry out at least one atomicity-related function other thanthe atomic instruction; and the feedback loop is adapted to override thefunction and give the atomic instruction priority.
 7. The system ofclaim 6, wherein the other atomicity related function comprises alarx/stcx type pair of instructions.
 8. The system of claim 6, whereinthe other atomicity related function comprises a thread executing undera TM model.
 9. The system of claim 1, wherein the atomic instructioncomprises sub-operations including a read, an increment, and a write.10. A system comprising: a plurality of processors adapted to issueatomicity related operations including at least one atomic instructionthat includes sub-operations, the sub-operations including a read, amodify, and a write; at least one other type of atomicity relatedoperation; at least one cache memory comprising: a cache data arrayaccess pipeline; and at least one controller adapted to prevent theother types operations from entering the cache data array accesspipeline, responsive to the atomic instruction in the pipeline, whenthose other types of operation compete with the atomic instruction inthe pipeline for a memory resource.
 11. A multiprocessor systemcomprising: a plurality of processors adapted to implement parallelspeculative execution of program threads, the processors being adaptedto implement a plurality of atomicity related techniques; a centralconflict checking mechanism adapted to resolve conflicts between thethreads; and conflict resolution protocol apparatus adapted toprioritize at least one atomicity related technique over at least oneother atomicity related technique.
 12. A computer method comprising:issuing an atomic instruction from a processor in a multi-processorsystem, the atomic instruction defining sub-operations that includeloading, modifying, and storing with respect to a memory resource;recognizing the atomic instruction in a directory based conflictchecking mechanism; and blocking other functions that seek to access thememory resource, until the atomic instruction has completed.
 13. Themethod of claim 12, wherein the directory based conflict checking systemis in a cache memory.
 14. The method of claim 13, comprising recognizingthe atomic instruction as part of a directory lookup in the cachememory.
 15. The method of claim 13, wherein the atomic instruction seeksto access a cache line, the cache memory includes at least one queue andblocking comprises preventing operations accessing the cache line fromentering the queue.
 16. The method of claim 12, comprising: issuinganother atomicity related operation from the processor; and blocking theother atomicity related operation responsive to the atomic instruction.17. The method of claim 16, wherein the other atomicity relatedoperation comprises a TM section of a program.
 18. The method of claim16, wherein the other atomicity related operation comprises a larx/stcxpair.
 19. The method of claim 12, comprising aggregating versions of amemory resource prior to undertaking the atomic instruction.
 20. Themethod of claim 12, wherein blocking comprises preventing operationsfrom entering a pipeline when there is an atomic instruction in thepipeline.
 21. The method of claim 20, comprising feeding back at leastone of the sub-operations into the pipeline, during blocking.
 22. Themethod of claim 20, comprising concatenating a plurality of atomicinstructions using a same memory resource in the pipeline duringblocking.